Group Number: 28
Group members and Student Number:
This project is the work of group 28 members and each and every group members have adhered to St. Clair College’s Academic Integrity Policies in completing the project.
R Version: Version 4.0.5
Rstudio Version: Version 1.4.1103
This data set on Transit cost Project is created by Thomas Mock.
Following link can be used to go to the data set page on GitHub
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-01-05/readme.md
This data set contains information on all the transit projects in more than 50 countries since late 1990s. It contains information like total cost of a project, which country and city it was built in, length of transit and tunnel length on each line built, when was the project started and expected end year, PPP rates of each cities and many more. These data can be use to analyze which country spent how much on a transit line and what was the cost per kilometers and compare each country to find out if what they are spending on a project is reasonable based on PPP rates.
Following 3 packages will be used in this project
1. tidyverse - this package contains the packege ggplot2, which will be used in visualization.
2. tidytuesday - this package contains the data set ‘Transit cost project’.
3. plotly - this package is used to make the plot more interactive.
library() function is used to load packages.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytuesdayR)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Following code is used to import data set and store them in the variable transit_cost
transit_cost <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-05/transit_cost.csv')
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## e = col_double(),
## rr = col_double(),
## length = col_double(),
## tunnel = col_double(),
## stations = col_double(),
## cost = col_double(),
## year = col_double(),
## ppp_rate = col_double(),
## cost_km_millions = col_double()
## )
## i Use `spec()` for the full column specifications.
Glimpse() function is used to get general idea of data set like type of each column and dimensions and also the first few entries in each columns.
glimpse(transit_cost)
## Rows: 544
## Columns: 20
## $ e <dbl> 7136, 7137, 7138, 7139, 7144, 7145, 7146, 7147, 7152,~
## $ country <chr> "CA", "CA", "CA", "CA", "CA", "NL", "CA", "US", "US",~
## $ city <chr> "Vancouver", "Toronto", "Toronto", "Toronto", "Toront~
## $ line <chr> "Broadway", "Vaughan", "Scarborough", "Ontario", "Yon~
## $ start_year <chr> "2020", "2009", "2020", "2020", "2020", "2003", "2020~
## $ end_year <chr> "2025", "2017", "2030", "2030", "2030", "2018", "2026~
## $ rr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ length <dbl> 5.7, 8.6, 7.8, 15.5, 7.4, 9.7, 5.8, 5.1, 4.2, 4.2, 6.~
## $ tunnel_per <chr> "87.72%", "100.00%", "100.00%", "57.00%", "100.00%", ~
## $ tunnel <dbl> 5.0, 8.6, 7.8, 8.8, 7.4, 7.1, 5.8, 5.1, 4.2, 4.2, 6.3~
## $ stations <dbl> 6, 6, 3, 15, 6, 8, 5, 2, 2, 2, 3, 3, 4, 7, 13, 4, 4, ~
## $ source1 <chr> "Plan", "Media", "Wiki", "Plan", "Plan", "Wiki", "Med~
## $ cost <dbl> 2830, 3200, 5500, 8573, 5600, 3100, 4500, 1756, 3600,~
## $ currency <chr> "CAD", "CAD", "CAD", "CAD", "CAD", "EUR", "CAD", "USD~
## $ year <dbl> 2018, 2013, 2018, 2019, 2020, 2009, 2018, 2012, 2023,~
## $ ppp_rate <dbl> 0.840, 0.810, 0.840, 0.840, 0.840, 1.300, 0.840, 1.00~
## $ real_cost <chr> "2377.2", "2592", "4620", "7201.32", "4704", "4030", ~
## $ cost_km_millions <dbl> 417.05263, 301.39535, 592.30769, 464.60129, 635.67568~
## $ source2 <chr> "Media", "Media", "Media", "Plan", "Media", "Media", ~
## $ reference <chr> "https://www.translink.ca/Plans-and-Projects/Rapid-Tr~
time <- as.integer(transit_cost$end_year) - as.integer(transit_cost$start_year);
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
ggplot(data = transit_cost)+
geom_freqpoly(mapping = aes(x = time))+
labs(title = "Time taken for project completion",
x = "Time in years",
y = "Projects",
caption = " Data source: https://transitcosts.com/")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 82 rows containing non-finite values (stat_bin).
The plot shows the relation between number of projects and time taken to complete those projects in years.Initially,approximately 5 projects were completed in between first 2 years.Then,the time of completion increased to just less than 5 years for 90 projects and more than 5 years for 100+ projects.Furthermore,the time taken for 50 projects to reach the end was nearly 7 years.The trend of projects after 7 years was decreased and the time taken to finish a projects was increased to around 10 years.In the mid of 10-11 year,arpund 20 projects reached there goal.At the end, the number of time to complete a project was increasing while the projects were decreasing.
ggplot(transit_cost)+
geom_histogram(aes(x = as.double(real_cost)), color = "black", fill = "orange")+
labs(title = "Cost distribution of all the projects",
x = "Real cost",
y = "Projects",
caption = "Cost in Millions of USD")
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7 rows containing non-finite values (stat_bin).
This bar graph provides information on construction cost in million USD for all the transit line projects. From the graph it can be seen clearly that majority of the project expenditure is less than 12000 Millions USD. There are very few projects whose cost ranged in between 12000 to 50000 million USD. There is only one project which costed more than 75000 million USD.
ggplot(transit_cost)+
geom_bar(aes(rr), color = "black", fill = "blue")+
labs(title = "Rail Road or no Rail Road",
x = "Rail Road",
caption = "0 indicates no Rail road and 1 indicates Rail road")
## Warning: Removed 8 rows containing non-finite values (stat_count).
This bar graph depicts how many projects were rail road projects or other projects, where 0 indicates no rail road and 1 indicates rail road. Only around 70 transit lines were rail road projects and on the other hand over 500 projects were other types of projects. This clearly shows the trend that in recent years very few work is done on Rail road and more importance is given to other type of transit systems.
ggplot(transit_cost)+
geom_bar(aes(source1))+
labs(title = "Source of each countries data",
x = "Data Source",
y = "Projects",
caption = "NA = Data not available")+
coord_flip()
This bar chart depicts the sources of data and project. The largest proportion of data is acquired from plan which is zenith of the graph. Media(90) and Wiki(70) followed by plan. The number of count is almost same in case of Orascom, Bechtel BACS and Fast consortium amounting to 5, 1, 4, orderly. The rest of sources was recorded null.
transit_cost %>% filter(country==c('CN', 'IN')) %>%
ggplot()+
geom_boxplot(aes(y = city, x = ppp_rate, color = country, fill = country))+
labs(title = "PPP rates of all the Chinese and Indian cities",
y = "Chinese and Indian cities",
x = "PPP Rate",
caption = "PPP = Purchasing Power Parity")+
theme(axis.text.y = element_text(size = 8))
This box plot graphical presentation display the Purchasing power parity of Chinese and Indian cities. China reported more purchasing parity than India in red and blue color respectively. Shanghai was the highest highlighted in China with more than 0.3 purchasing power. In contrast, Bangalore exhibited lowest proportion in India. The cities Beijing, Changchun, Chongqing, Nanjing and Wuhan wavered between 0.2 and 0.3 purchasing power parity .The cities in India like Chennai, Gurgaon and Mumbai had least purchasing power parity among all the cities from these two countries.
ggplot(data = transit_cost)+
geom_col(mapping = aes(x = country, y = as.double(real_cost)))+
theme(axis.text.y = element_text(size = 5))+
labs(title = "Construction cost in each country",
y = "Real cost of construction",
x = "Country",
caption = "cost in Millions of USD")+
coord_flip()
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning: Removed 7 rows containing missing values (position_stack).
The graph displays the total amount of money spent on construction of Transit lines. The construction in china was fourfold more than India. In comparison, Portugal, Belgium and Norway spent around 1000 million USD. China, India and South Africa are the top three countries in expenditure. Near about 20 nations shell out negligible amount as compared to China on construction.
transit_cost %>% filter(country=="CN") %>%
ggplot()+
geom_col(aes(x = as.factor(city), y = as.double(real_cost)), color = "Black", fill = "red")+
labs(title = "Construction cost in Chinese cities",
x = "Cities in China",
y = "Real cost of construction",
caption = "Cost in Millions of USD")+
coord_flip()
The plot shows the real cost of construction in different cities of china. At first glance, it is clear that the cost of construction is higher than 150000 millions USD in shanghai which is the largest amount among all cities. Shanghai was followed by Beijing and Wuhan. In addition, the least investment of nearly 10000 million USD was in Lanzhou. Moreover, Real cost of construction was below 50000 million USD in 23 cities of china. The data recorded for Xiamen, Guliyang, and Dongguan was nearly same. Adding to it, Shenzhen and Guangzhou account for around 60000 millions USD of estimate for project. Out of 23 cities,the number of cities which has cost construction less than 25000 million USD is 13.
transit_cost %>% filter(country == 'IN') %>%
ggplot() +
geom_point(mapping = aes(x = length, y = (as.integer(end_year)-as.integer(start_year))), color = "red") +
facet_grid(source1 ~ city)+
theme(axis.text.x = element_text(angle = 90))+
labs(title = "Time taken to construct different length lines in Indian cities",
x = "Line length",
y = "Construction time",
caption = "Time in years and Length in Kilometers")+
coord_flip()
## Warning: Removed 2 rows containing missing values (geom_point).
The grids display the data of projects and there source in some Indian cities along with the length of line in km and the time taken to complete the projects. Maximum projects data of Mumbai came from its plan and all projects had length of less than 50km were completed by 7.5 years except 1 which took around 10 years for completion. Next to it, Ahemdabad and Delhi had only one project whose data came from from wiki and their length was nearly 50km and 20km and it took more than 7.5 years to complete for Ahemdabad.Furthermore, Banglore had 2 projects, one was sourced from Trade and another from wiki and the length of the line was 75km and just less than 50km respectively. Gurgaon, Hydrabad, Kochi and Nagpur had only one project. The source was same for Hydrabad or Nagpur but they had differnce of almost 25km in length of line. In addition, the time taken to complete the project in Gurgaon and Kochi was more than 2.5 years.
interactive_plot <- transit_cost %>% filter(country=='IN') %>%
ggplot()+
geom_point(aes(x = cost_km_millions, y = line, color = city, size = cost, alpha = 0.5))+
labs(title = "Details of transit lines in Indian Cities",
x = "Cost per Kilometers in millions",
y = "Transit line")
ggplotly(interactive_plot)
Interactive plot registered the transit lines in different Indian cities along with cost in millions of USD per kilometer. Line 3 being constructed in Mumbai has the maximum construction cost per kilometer of around 450 million USD but overall construction cost of phase 2 transit line in Chennai is maximum in the country. Whereas, Airport express transit line of Hyderabad seems to have the least per kilometer construction expenditure. Majority of construction project are from Mumbai followed by Delhi and Chennai. Line 3 of Mumbai, Phase 2 of Chennai and Line 11 of Mumbai are the three projects with maximum per kilometer construction costs.
Data visualization is the representation of data or information in a graph, chart or other visual format. There are multifarious aspects in which it plays significant role to understand data set.
Data visualization is important to communication important aspects of a data set because it provides exact data or information in the form of maps and graphs which makes it very easy to identify trends, patterns and outliers with in large data sets. A picture is better than thousand words and data visualization does the same job by giving more information in a single visualization without overloading the human brain.
Data visualization is a method to present information in easy to interpret way. As an analyst we should not hide or change data set while working on it and always present the truth logically and in an easily understandable way. An analyst should not bring in any personal opinion or should not be biased while working on the data.
We believe that 3 variable will be maximum which we can successfully represent in a visualization. Its easy to use more than 3 variables in a visualization but that will make the plot packed with lot of information and that can make a visualization very complicated. Main function of a visualization should be to convey any information in very easy method and to make it understandable to people at a single glance. Giving more information in a single visualization will more difficult to understand and it can become very easy to miss out on some of the very important information which can be disastrous to any business.
Final four questions were discussed with all the group members and writing part was also done by all the group members
Ibrahim Hussain (#0773950)
Responsible for plotting interactive plot and its explanation.
Manishpreet Kaur (#0784729)
Responsible for plotting two plots displaying the distribution of a single continuous variable and their explanation.
Neha Neha (#0788774)
Responsible for plotting two plots displaying information that shows a relationship between two variables and their explanations.
Vinny Sachdeva (#0755811)
Responsible for plotting two plots displaying information about a single categorical variable and their explanations.
Rakesh Singh (#0775942)
Responsible for plotting one plot displaying information about both a continuous variable and a categorical variable and one plot using faceting and also their explanations.